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Abstract 

While existing face recognition systems based on local 
features are robust to issues such as misalignment, they 
can exhibit accuracy degradation when comparing im- 
ages of differing resolutions. This is common in surveil- 
lance environments where a gallery of high resolution 
mugshots is compared to low resolution CCTV probe 
images, or where the size of a given image is not a reli- 
able indicator of the underlying resolution (eg. poor op- 
tics). To alleviate this degradation, we propose a com- 
pensation framework which dynamically chooses the 
most appropriate face recognition system for a given 
pair of image resolutions. This framework applies a 
novel resolution detection method which does not rely 
on the size of the input images, but instead exploits the 
sensitivity of local features to resolution using a proba- 
bilistic multi-region histogram approach. Experiments 
on a resolution-modified version of the ''Labeled Faces 
in the Wild" dataset show that the proposed resolution 
detector frontend obtains a 99% average accuracy in 
selecting the most appropriate face recognition system, 
resulting in higher overall face discrimination accuracy 
(across several resolutions) compared to the individual 
baseline face recognition systems. 



1 Introduction 

Face images obtained in surveillance scenarios typi- 
cally have issues such as misalignment and variations 
in pose and illumination. Here we address a further 
issue, namely varying image resolution [9], encoun- 
tered while undergoing real-world system trials for the 
UK police and other agencies. Mismatched resolutions 
between probe and gallery images can cause signifi- 
cant performance degradation for face recognition sys- 
tems, particularly those which use high-resolution faces 
(eg. mugshots or passport photos) as gallery images. 
Another source of resolution mismatches is due to the 
fact that the size (in terms of pixels) of a given face im- 
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age may not be a reliable indicator of the underlying 
optical resolution. Examples include: (i) poor quality 
optics in low-cost cameras can act as low-pass filters; 
(ii) poor focus and over-exposure can result in blur and 
loss of detail; (iii) a given gallery face is provided in an 
already resized form and the original size is unknown 
(eg. digital scan of a photograph). 

Face recognition approaches can be placed into 
two general families: holistic and local-feature based. 
In typical holistic approaches, a single feature vector 
describes the entire face and the spatial relations be- 
tween face characteristics (eg. eyes) are rigidly kept. 
Examples of such systems include PCA and Fisher- 
faces [2]. In contrast, local-feature based approaches 
describe each face as a set of feature vectors (with each 
vector describing a small part of the face), with re- 
laxed constraints on the spatial relations between face 
parts [4]. Examples include systems based on elastic 
graph matching, hidden Markov models (HMMs) and 
Gaussian mixture models (GMMs) [4]. 

Local-feature based approaches have the advantage 
of being considerably more robust against misalignment 
(caused by automatic face detectors) as well as varia- 
tions in illumination and pose [4, 11]. As such, these 
systems are more suitable for dealing with faces ob- 
tained in surveillance contexts. However, almost all 
of the literature on addressing resolution mismatches 
(eg. [5, 7]) deals with holistic approaches and naively 
assumes that faces are localised perfectly (ie. no mis- 
alignment) as well as being frontal (ie. no pose varia- 
tions). 

In typical local-feature based face recognition sys- 
tems, the size of probe and gallery face images must be 
the same prior to feature extraction [3]. As such, the 
given faces are normally resized to a common interme- 
diate format (IF) prior to further processing^ (eg. low- 
resolution faces are upscaled while high-resolution 
faces are downscaled), and recognition systems are of- 
ten tuned to work with that particular image size. The 
use of IF processing leads to three problems in mis- 
matched resolution comparisons, elucidated below. 

(i) For low-resolution images, upscaling does not in- 
troduce any new information, and can potentially intro- 
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duce artifacts or noise. Also, upscaled images are blurry 
(Fig. 1), which causes the extracted features to be very 
different than those obtained from the downscaled high- 
resolution faces, resulting in a significant drop in recog- 
nition accuracy [5]. Thus upscaling is generally not a 
good solution to the low-to-high resolution mismatch 
problem. It might be tempting to employ techniques 
such as super-resolution or hallucination [1], however 
super-resolution requires several images (which may 
not be available) in addition to precise alignment [9] . 

(ii) Prima facie, if upscaling is not a good solution, 
one may think that downscaling high-resolution images 
will solve the resolution mismatch issue. However, 
downscaling reduces the amount of information avail- 
able, thereby reducing the recognition performance. 
Situations can arise where the given probe face image 
is larger than the IF image size (eg. obtained through 
a telephoto lens). To allow maximum accuracy wher- 
ever possible, the recognition system should ideally be 
able to detect situations where a high-to-high resolu- 
tion comparison is possible (ie. with a larger IF) and 
when it should do a low-to-high resolution face com- 
parisons (ie. with a smaller IF). Typically, one IF pro- 
cessing chain alone is not sufficient to achieve this. 

(iii) Resizing pre- supposes that the original sizes of 
the given images are an indicator of the underlying res- 
olutions. This is often not the case in the poorly con- 
trolled image datasets encountered in practice. Thus a 
resolution detector is necessary to identify whether the 
underlying resolution of the probe image is high or low. 

In this paper we present a novel method to handle 
resolution mismatches for the recently proposed Multi- 
Region Histograms (MRH) local-feature approach, 
which can be thought of as a hybrid between the HMM 
and GMM based systems [12]. Specifically, we pro- 
pose: (i) the use of two IF sizes (small and large), with 
the small IF size targeted for reducing resolution mis- 
matches caused by upscaling, and the large IF size tar- 
geted for high discrimination performance when little 
to no resolution mismatches are present; (ii) a dedi- 
cated resolution detector frontend to address situations 
where the actual resolution of given faces is unknown 
(ie. where the size of given faces cannot be relied upon 
to determine the resolution); (iii) to employ the resolu- 
tion detector, as part of a resolution mismatch compen- 
sation framework, to determine which of the two IF im- 
age sizes to use when comparing two face images with 
unknown resolutions. 

We continue the paper as follows. In Section 2 we 
briefly describe the MRH-based face recognition ap- 
proach. The proposed resolution mismatch compen- 
sation framework is described in Section 3. Section 4 
presents experiments on the recent Labeled Faces in 
the Wild (LFW) dataset [8], which contains problem- 
atic face variations akin to those found in surveillance 
scenarios. The main findings are presented in Section 5. 
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Figure 1. Original images of varying size upscaled to a 
size of 64x64 (via bilinear interpolation), resulting in im- 
ages of fixed size but with varying underlying resolution. 

2 Probabilistic Multi-Region Histograms 

The MRH approach is motivated by the 'visual 
words' technique originally used in image categorisa- 
tion [10]. Each face is divided into several fixed and 
adjacent regions, with each region comprising a rela- 
tively large part of the face. For region r a set of feature 
vectors is obtained, Fr = {/^ Jili, which are in turn 
attained by dividing the region into small overlapping 
blocks (or patches) and extracting descriptive features 
from each block via 2D DCT decomposition [6] . Each 
block has a size of 8 x 8 pixels, which is the typical size 
used for DCT analysis. To account for varying contrast, 
each block is normalised to have zero mean and unit 
variance. Based on [12], coefficients from the top-left 
4x4 sub-matrix of the 8x8 DCT coefficient matrix are 
used, excluding the 0-th coefficient (which has no infor- 
mation due to the normalisation). 

For each vector /^^ obtained from region r, a prob- 
abilistic histogram is computed: 
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(1) 



where the ^-th element in hr,i is the posterior probabil- 
ity of /^^ according to the ^-th component of a visual 
dictionary model. As the visual dictionary is a mixture 
of Gaussians, the mean of each Gaussian can be thought 
of as a particular 'visual word' . 

Once the histograms are computed for each feature 
vector from region r, an average histogram for the re- 
gion is built: 

1 v-^^ 
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The overlapping during feature extraction, as well as 
the loss of spatial relations within each region (due to 
averaging), results in robustness to translations of the 
face which are caused by imperfect face localisation. 
The DCT decomposition acts like a low-pass filter, with 
the information retained from each block being robust 
to small alterations (eg. due to minor in-plane rotations). 
The normalised distance between faces X and Y is 
calculated using: 

^raw(X,y) 
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where d is the i-th cohort face and M is the number 
of cohorts, while draw(-, •) is a Li-norm based distance 
measure between histograms from R regions: 
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Cohort faces are assumed to be reference faces that are 
known not to be of persons depicted in X or y. The 
denominator in Eqn. (3) estimates how far away, on av- 
erage, faces X and Y are from a randomly selected face. 
This typically results in Eqn. (3) being approximately 1 
when X and Y represent faces from two different peo- 
ple, and less than 1 when X and Y represent two in- 
stances of the same person. 

3 Proposed Compensation Framework 

In order to handle resolution mismatches when the 
size of given face images cannot be relied upon as an 
indicator of the underlying resolution, it is necessary to 
analyse the content of the given images and determine 
whether the images can be downscaled to a more appro- 
priate size. 

In this work we use two IF image sizes, namely A 
and B. We define size A as 64x64 and size B as 32x32. 
It is important to note that due to the low-pass filter- 
ing effect of the DCT analysis, MRH-based recognition 
tuned for size A (where all given images are resized to 
size A) is able to handle images which have an under- 
lying resolution ranging from 32x32 to 64x64, while 
MRH-based recognition tuned for size B is suited for 
32x32 and lower resolutions (ie. 16x 16 and 8x8). 

The detector uses two sets of reference faces: Sa and 
Sb . In each set the faces have a canonical size of 64 x 64 
pixels, though in each set the underlying resolution is 
different. Set Sa contains faces which are downscaled 
versions of the underlying high resolution faces. In set 
Sb the underlying high resolution faces were first down- 
scaled to 16 X 16, followed by upscaling to the canonical 
size (ie. deliberate loss of information). 

The detector co-opts the framework and processing 
used by the MRH approach, in order to exploit the sen- 
sitivity of local DCT features to resolution mismatches. 
In essence, the detector measures whether a given 
face is more similar to either low-resolution or high- 
resolution reference faces. The processing steps are: 

1. The given face Q is rescaled to the canonical size 
(64x64), regardless of the original size of Q. 

2. MRH analysis with 3x3 regions and 1024 visual words 
is performed (using parameter settings as in [12]). 

3. The average distance of Q to faces in sets Sa and Sb is 
found: I ^ I 

davg((^,^^) = |^^r'y ^raw(Q,^^,,) (5) 

where i G {A, B}, Sij is the j-ih face of set Si and 
\Si\ is the number of faces in set Si . 

4. The smallest average distance, either c/avg(Q, ^Sa) or 
c?avg(Q, 5'b), indicates whether MRH tuned for either 
size A or B, respectively, should be used for recognition. 



Table 1. Classification performance of the proposed im- 
age resolution detector frontend. All given face images 
have one size (64x64) but the underlying resolution varies 
(8x8 to 64x64). Face images are classified as being suit- 
able for MRH-based face recognition using either size A 
or B. MRH tuned for size A is suitable for images with 
an underlying resolution of 32x32 or higher, while MRH 
tuned for size B is more suited for lower resolutions. 



Underlying 


Size 


Size 


Resolution 


A 


B 


64x64 


99.87 % 


0.13 % 


32x32 


98.06 % 


1.94% 


16x16 


1.94% 


98.06 % 


8x8 


0.00 % 


100.00 % 



4 Experiments and Discussion 

We used the Labeled Faces in the Wild (LFW) 
dataset which contains 13,233 face images (from 5749 
unique persons) collected from the Internet [8]. The 
faces exhibit several compound problems such as mis- 
alignment and variations in pose, expression and illumi- 
nation. In our experiments we extracted closely cropped 
faces^ (to exclude the background) using a fixed bound- 
ing box placed in the same location in each LFW image. 

In LFW experiment protocols the task is to classify 
a pair of previously unseen faces as either belonging to 
the same person or two different persons [8]. Perfor- 
mance is indicated by the mean of the accuracies from 
10 folds of the 10 sets from view 2, in a leave-one-out 
cross-validation scheme (ie. in each fold 9 sets are used 
for training and 1 set for testing, with each set having 
300 same-person and 300 different-person pairs). 

To study the effect of resolution mismatches, the first 
image in the each pair was rescaled to 64 x 64 while the 
second image was first rescaled to a size equal to or 
smaller than 64x64, followed by upscaling to the same 
size as the first image (ie. deliberate loss of information, 
causing the image size to be uninformative as to the un- 
derlying resolution). The underlying resolution of the 
second image varied from 8x8 to 64x64. 

In experiment 1 we evaluated the classification per- 
formance of the proposed resolution detector frontend. 
Reference faces for sets Sa and Sb were taken from the 
training set. Preliminary experiments indicated that us- 
ing 32 faces for each reference set was sufficient. The 
second image in each pair from the test set was then 
classified as being suitable for MRH-based face recog- 
nition using either size A or B. Recall that an MRH- 
based face recognition system tuned for size A is suited 
for faces which have an underlying resolution of 32x32 
or higher, while a corresponding system tuned for size B 
is more suited for lower resolutions. The results, shown 
in Table 1 , indicate that the frontend detector is able to 
assign the most suitable size almost perfectly. 



^Available from http://itee.uq.edu.au/~conrad/lfwcrop/ 



In experiment 2 we evaluated the performance of 
three MRH-based systems for classifying LFW image 
pairs subject to resolution mismatches. Systems A 
and B were tuned for size A and B, respectively, while 
the dynamic system applies the proposed compensation 
framework to switch between System A and B accord- 
ing to the classification result of the resolution detector. 

Comparing the results of the two baseline systems 
(A and B) in Table 2 confirms that System A outper- 
forms System B when matching images of similar un- 
derlying resolution (ie. 64x64 and 32x32), but signif- 
icantly underperforms System B when there is a con- 
siderable resolution mismatch (16x16 or lower). Sys- 
tem B is able to achieve more rounded performance at 
the expense of reduced accuracy for the highest resolu- 
tion (64x64). 

The proposed dynamic system is able to retain the 
best aspect of System A (ie. good accuracy at the high- 
est resolution) with performance similar to System B at 
lower resolutions. Consequently, the dynamic system 
obtains the best overall performance. 

We note that in three out of the four tested reso- 
lutions, the dynamic system slightly outperforms the 
best underlying system. Based on observations of the 
original LFW dataset, we conjecture that this outperfor- 
mance is due to a subset of LFW images already having 
a low underlying resolution. 

5 Conclusion 

In this paper we have shown how comparing images 
with different underlying resolutions can lead to a sig- 
nificant drop in performance for a local feature based 
face recognition system, and proposed a compensation 
framework to improve overall performance (across sev- 
eral resolutions). The proposed framework relies on a 
novel resolution detector frontend which exploits the 
sensitivity of local features to resolution. The perfor- 
mance of this resolution detection and compensation 
framework was demonstrated on a resolution-modified 
Labeled Faces in the Wild [8] dataset using the Multi- 
Region Histogram based recognition system. 

In our experiments, two systems (A and B) were 
tuned to different underlying resolutions. System A, 
tuned for higher underlying resolutions, was shown to 
outperform System B when comparing images of simi- 
lar underlying resolution (64x64 and 32x32), while un- 
derperforming when comparing images of very differ- 
ent underlying resolution (16x16 and 8x8). The re- 
verse was true for System B, tuned for lower resolu- 
tions. The proposed dynamic compensation framework 
was able to maximise performance by applying the sys- 
tem best tuned for any given pair of images based on 
their underlying resolutions. This potential to utilise the 
strengths of multiple face recognition systems clearly 
demonstrates the advantage of the compensation frame- 
work. 



Table 2. Performance of three MRH-based systems for 
classifying LFW image pairs with resolution mismatches. 
All images had a fixed size of 64x64, but in each pair the 
second image had the underlying resolution varying from 
8x8 to 64x64 (see Fig. 1). Systems A and B were tuned 
for size A and B, respectively, while the dynamic system 
switched between system A and B according to the classi- 
fication result of the resolution detector. 



Underlying 


System 


System 


Dynamic 


Resolution 


A 


B 


System 


64x64 


74.25 % 


70.28 % 


74.35 % 


32x32 


70.36 % 


69.99 % 


70.47 % 


16x16 


59.35 % 


68.08 % 


67.62 % 


8x8 


53.13% 


59.40 % 


59.90 % 


Average 


64.27 % 


66.94 % 


68.09 % 



For a given pair of resolution-modified images from 
the LFW dataset, the proposed resolution detector was 
able to classify which face recognition system was the 
optimal one 99% of the time on average. This indicates 
nearly perfect face recognition system selection when 
used in the compensation framework. 
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